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Abstract 

The widely applicable Bayesian information criterion (WBIC) is a simple and fast 
approximation to the model evidence that has received little practical consideration. 
WBIC uses the fact that the log evidence can be written as an expectation, with respect 
to a powered posterior proportional to the likelihood raised to a power t* G (0,1), of 
the log deviance. Finding this temperature value t* is generally an intractable problem. 

We find that for a particular tractable statistical model that the mean squared error 
of an optimally-tuned version of WBIC with correct temperature t* is lower than an 
optimally-tuned version of thermodynamic integration (power posteriors). However 
in practice WBIC uses the a canonical choice of t = l/log(n). Here we investigate 
the performance of WBIC in practice, for a range of statistical models, both regular 
models and singular models such as latent variable models or those with a hierarchical 
structure for which BIC cannot provide an adequate solution. Our findings are that, 
generally WBIC performs adequately when one uses informative priors, but it can 
systematically overestimate the evidence, particularly for small sample sizes. 

Keywords and Phrases: Marginal likelihood; Evidence; Power posteriors; Widely appli¬ 
cable Bayesian information criterion. 

1 Introduction 

The Bayesian paradigm offers a principled approach to the issue of model choice, through 
examination of the model evidence, namely the probability of the data given the model. Sup¬ 
pose we are given data y and assume there is a collection of competing models, mi,..., m;, 
each with associated parameters, 9i,...,9i, respectively. Viewing the model indicators as 
parameters with prior distribution p(m*), the posterior distribution of interest is 

p(9 k , m k \y) oc f{y\9 k , m k )p(9 k \m k )p(m k ) 

1 Address for correspondence: nial.friel@ucd.ie 
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where f{y\9 k: m k ) is the likelihood of the data under model m k with parameters 9 k and 
p(6 k \m k ) is the prior on the parameters in model rn k . 

The constant of proportionality for the nn-normalised posterior distribution for model 
rrik is the marginal likelihood or evidence, 


p(y\m k ) = / f(y\9 k ,m k )p(9 k \m k )d9 k . 

Jo k 

This is a vital quantity in Bayesian model choice and developing good estimates of it con¬ 
tinues to be an active area of research in computational statistics. Henceforth, for brevity 
of notation, we will drop the dependence on m k , so that we refer to the evidence, likelihood 
and prior distribution for a given model as, p(y), f (y\9 ), p{6 ), respectively. 

There are a growing number of techniques to evaluate the evidence, see for instance, 


Gelman and Meng (1998) for a thorough review of importance, bridge and path sampling 


methods, Robert and Wraith (2009) for an updated review of such methods that includes the 


more recent mixture bridge-sampling approach (Chopin and Robert 2007), the generalised 


harmonic mean estimator (Gelfand and Dey 1994) and nested sampling (( 


perhaps, (Burrows 1980)), in addition to (Friel and Wyse 2012) who compare the accuracy 


Skilling 2006), or 


and computational burden of these methods. 

The contribution of this work is to explore a new method of approximating the evidence, 
the widely applicable Bayesian information criterion (WBIC) of Watanabe (2013). WBIC 
was motivated by the fact that the Bayesian information criterion (BIC or Schwarz criterion) 


(Schwarz 1978) is not applicable to singular models. A statistical model is termed regular 


if the mapping from model parameters to a probability distribution is one-to-one and if its 
Fisher information matrix is positive definite. Otherwise, a statistical model is singular. 
Singular models arise, for example, in latent variable models such as mixture models, hidden 
Markov models and hierarchical models such as artificial neural networks and so on. Singular 
models cannot be approximated by a normal distribution, which implies that BIC and AIC 


are not appropriate for statistical model choice. Watanabe (2013) has shown that WBIC 


converges to the model evidence, asymptotically as n —> oo, for both singular and regular 
statistical. In this sense, WBIC is generalisation of BIC. 

WBIC is straightforward to evaluate, requiring only one Markov chain Monte Carlo 
(MCMC) chain to estimate the evidence. Thus far, WBIC has received no more than a 


cursory mention by Gelman et al. (2013) and while it has been applied in practice to a 


specific reduced rank regression model, see unpublished work by Drton and Plummer (2013) 


and for the case of Gaussian process regression (Mononen 2015), beyond Watanabe’s own 
implementation there has been no further exploration of the criterion. The focus of this is 
to assess its performance as an approximation of the evidence. 

The paper is organised as follows, the key results and notation for power posteriors, 
necessary for WBIC, are presented in Section [2j Watanabe’s WBIC is presented in Section 
[3j Section [4] presents a theoretical comparison of WBIC and the power posterior approach. 
The performance of WBIC compared to several competing methods is illustrated in Section 
[5] for four examples. The article is concluded with a brief discussion in Section [6] 
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2 Power posteriors 


Friel and Pettitt (2008]) propose the method of power posteriors, a path sampling type 
method, to evaluate the marginal likelihood (or evidence) in an application of the thermo¬ 


dynamic integration technique from statistical physics. Dating to Kirkwood (1935), ther¬ 


modynamic integration has a long history in the statistical physics literature. An in-depth 
background to thermodynamic integration and Bayes free energy (also known as marginal 
likelihood) calculations for context specific statistical models is given by Cliiput and Po- 
horille (2007). In addition, the slow growth method of Bash et al. (1987) is a notable 
forerunner to the method of power posteriors. In the statistics literature the use of thermo¬ 


dynamic integration is detailed thoroughly by Neal (1993) together with other techniques 


from statistical physics and furthermore by Gelman and Meng (1998) and more recently by 


Friel and Pettitt (2008). 


As in (Friel and Pettitt 2008), for data y, parameters 9 and temperature parameter 


t E [0,1], define the power posterior as the annealed distribution 


p(9\y,t) oc f(y\0Yp(9), 


( 1 ) 


which has normalising constant defined as 

Zt(y) = [ f{y\e) t p(6)de. (2) 

Je 

Throughout we assume that p[9\y,t) is proper, so that z t (y ) exists for all t e [0,1], in 
particular, this assumes a proper prior. Clearly, the evidence is realised when t — 1, that 
is, Zi(y) = p(y) and when t = 0 the integration is over the prior with respect to 9 , thus 
Zq(v ) = 1. In what follows we make use of the power posterior identity 

log p(y) = l°g{|M} = ^ ^e\ y ,t log f(y\9) dt. (3) 

In fact, more generally one can express the log-evidence as 


log p{y) = E 0 , t \ y [hgf{y\9)/p{t)]. (4) 

for some temperature prior p(t). In practice the log-evidence is estimated, using a discretised 
temperature schedule, t £ [0,1], 0 = t 0 < fi,..., t m — 1 and MCMC draws 6 { p for i = 
1,2,... ,N from each power posterior p(9\y, tj ), as 

m (t -t A 

log p(y) ~ - o J ( Ee| 2 />b l °gf(y\9) + E 0 | log f(y\9 )). (5) 

3 = 1 

Using a burn-in of K < N iterations, E g\ yjt log f(y\9) is estimated for a fixed tj by 

1 N 

E %,q log f(y\9) ~ N _ R lo g P(y\ e f)- ( 6 ) 

j=K +1 
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Alternatively, the updated power posterior estimate of Friel et al. (2014) employ a cor¬ 
rection to the trapezoidal rule such that 


log p(y) ~ 


[tj tj- 1) 


3 = 1 


-E 

3 = 1 


(tj tj_i) 
12 


( E e\y, tj \og f{y\9) + E fl |log/(y|0)) 
( v %,% log/(2/|^) - log/( 2 /| 0 )) , 


(7) 


where Vg\ y j log f(y\9) is the variance of log f(y\9) with respect to the power posterior, 
p(9\y,t). This approximation consistently out-performs the standard estimate with no ad¬ 
ditional computation cost. Indeed recent work by Oates et al. (2016) has shown that is 


possible to achieve further improvement through the use of control variates to efficiently 
estimate Eg^t log f(y\9) and ~Ve\ y ,tj log f{y\9) for each temperature tj E [0,1], at very little 
extra computational cost. Together with the numerical integration scheme ([7]), the authors 
have shown that this can yield a dramatic improvement in the statistical efficiency of the 
estimate of the evidence. Finally we note that Hug et al. (2016) presents a further refine¬ 


ment of the power posterior approach which has shown to provide an improvement in the 
numerical integration over the temperature parameter. 


3 Widely applicable Bayesian information criterion 


The widely applicable Bayesian information criterion (WBIC) (Watanabe 2013) promises 


to reduce the considerable computational burden that the method of power posteriors and 
indeed other evidence estimation methods suffer from. The key to WBIC is that there exists 
a unique temperature, say t* E [0,1], such that, 


log p(y) = E e\ y ,t* log f(y\9). 


( 8 ) 


Hence, given this temperature t*, only one Markov chain needs to be simulated at only 
one temperature value to estimate the evidence, using samples 9^ for i — 1, 2,..., N from 
the power posterior p(9\y,t*) and equation ([6]). The fact that equation (J8| holds follows 
straightforwardly from the mean value theorem, which shows that there exists a particular 
t* such that 


\ogzi{y)-logzo(y) d 
log Ply) = -- = dt logZt ^ 


= E e \y,t* log f{y\9). 


(9) 


t* 


Uniqueness of t* follows from the fact that d 2 log z t (y)/dt 2 = ~Vg\ Vi t log f(y\9) and hence is 
strictly positive under standard regularity assumptions. 

In fact it is possible to provide an information theoretic interpretation of the optimal 
temperature t* in (|9]). Following Vitoratou and Ntzoufras (2013), it is straightforward to 
prove that pt*, the power posterior at the optimal temperature t*, is equi-distant, in terms 
of Kullback-Leibler distance, from the prior and posterior. We can show this as follows. 

Here, for brevity, we introduce the notation pt{9) = p(9\y,t ) for all t G [0,1]. 
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Lemma 3.1. The power posterior at the optimal temperature t* satisfies the identity 


KLipvWpfi) = KL(pt*\\po). 

Proof. Using the definition of Knllback-Leibler distance, we can re-write the statement of 
this lemma as, 


f Pt*{d) dd 

Je P i(0) 

J p t *(9) log pi (0) d6 

i Me)Xog m de 


[/•■ {e) ' osP fM de 

I p t *(6) log Po{9) dd 


= 0 


= 0 


log p(y) = J p t * (9) log f(y\9) d9 
= ^e\ y ,t* log/(j/|0), 


which holds from equation (Joj). 


□ 


This result may prove useful as a basis for estimating the optimal temperature. However, 
one should note that both the posterior, p\{9) and the power posterior at the optimal tem¬ 
perature, p t * are generally intractable, leaving a direct evaluation of this identity unavailable. 
Indeed we will use this result in Section [4] for an idealised analysis of a comparison of the 
mean squared error arising from the identity in ([9]) and that arising from Q. 

Clearly, finding this optimal temperature t* is a challenging task. The main contribution 
of Watanabe (2013) is to show asymptotically, as the sample size n —* oo, that t* ~ 1/ log(rt). 


WBIC is thus defined as 


WBIC = E e \ yU log f{y\e) « log p(y), (10) 


where t w — 1/ log(n). 

Watanabe (2013) introduced WBIC in the context of algebraic geometry where it is 
applied to solve the problem of singularity in the statistical models commonly encountered 
for models with latent variables or hierarchical structure including, mixture models, hidden 
Markov models, neural networks and factor regression models. In the case of singular models 
(models where the mapping from parameters to a probability distribution is not one-to- 
one and where the Fisher information matrix is not always positive definite) the standard 
Bayesian information criterion (BIC or Schwarz criterion) (Schwarz 1978) approximation to 
the marginal likelihood is known to be poor (Chickering and Heckerman 1997). As before, 
the literature is lacking a comprehensive evaluation of the performance of WBIC in both 
regular and singular settings at finite n; this is our contribution below. 
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4 A theoretical comparision of WBIC and power pos¬ 
teriors 


In this section we present a theoretical comparison of the mean squared error resulting from 
idealised implementations of both WBIC and the power posterior approach. Following Q, 
thermodynamic integration or power posteriors (PP) is based on the identity 

log p(y) = E eMy [\ogf(y\0)/p(t)]. 

Here p(t) is arbitrary, but for this comparison we suppose we have access to an an optimal 
choice (that minimises the Monte Carlo variance) and is given by 


p*(t) = argmmVg My [log(p(y\0))/p(t)] 

p(t) 

oc (E elt J(logp(y\6)) 2 ]) 1/2 


as shown in Calderhead and Girolami (2009). It has been shown in numerous studies that 


this optimal choice can be well approximated using power law heuristics. 
As before, the recently proposed WBIC, is based on the identity 


log p(y) = E e \ t , ty [\ogp(y\9)] 


where t* is the solution to KL(p f *||p 0 ) = KL(p t *||pi) where p t denotes the power posterior 
p(0\t,y). For this comparison we suppose we have access to t*. 

We note here that WBIC is not a special (or degenerate) case of PP, but there are some 
visual similarities. WBIC actually uses more information on the smoothness of the integrand, 
compared to PP, so from this simple perspective WBIC can be expected to perform better 
in principle. 

Being expressed as expectations, both identities can, in principle, be used to produce 
unbiased estimates of log p(y) via Monte Carlo. The natural theoretical question to ask is 
which Monte Carlo estimator has the lower variance. An instructive analytical analysis of 
this idealised case is provided below. 


Example 1. Consider the following simple example, taken from Friel and Pettitt (2008) 
and considered elsewhere by (Gelman et al 2013). Suppose data y — {y t : i — 1,... ,n} are 
independent and yt ~ N(0, 1). Assuming an informative prior 6 ~ N(m,v), this leads to a 
power posterior, 0\y, t ~ N(m t , vf) where 


nty + m/v 

m t = —— and v t = 
nt + 1 v 


nt + l/v 


It is straightforward to show that 


E%,tlog/(#) = log 2 tt - - ^(yi - yf 


Z— 1 


n (m — y) 2 n 1 
2 (vnt + l) 2 2 (nt + l/v)' 


( 11 ) 
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Figure 1: Normal means model. Comparison of (idealised) thermodynamic and WBIC 
estimator variances. Left: n — 1. Middle: n = 100. Right: n = 10000. In each panel we 
show the case of prior mean m G {0,1} for varying values of the prior variance v. Data y 
were generated from the model with 9 = 0. 


Moreover, it is easy to show that 


log p(y) 


n 1 v 

--log27T - - log — 
2 2 v* 



(■ny + m/v ) 2 
n + 1/v 


( 12 ) 


where v* = is the posterior variance of 9. This example is useful since not only does 

it allow analytical evaluation of 0 and but it also possible to analytically find the 

optimal temperature t* as the solution to the identity in Lemma\3fl. 


Figure [I] displays the Monte Carlo variances of the (idealised) thermodynamic and WBIC 
estimators. Data y were generated from the model with 9 = 0 where a priori, 9 ~ N(m,v ) 
and we present results for m G {0,1} and for various values of v. For this example, 


Vo\t*,y\\ogp(y\9)] < V e p y [log(p(y\9))/p*(t)], 
for all combinations of ( y,m,v ). 

For uninformative data (n small; left panel) the PP and WBIC estimators have essentially 
equal Monte Carlo variance. For informative data (n large; right panel) the PP estimator 
variance can be orders of magnitude larger than the WBIC variance at all values of the prior 
variance v. In all cases the Monte Carlo variance of PP increases relative to WBIC as the 
prior variance v goes to infinity. This reflects the fact that PP has to evaluate an expectation 
at t = 0, while the WBIC estimator deals with t* > 0 (although t* —> 0 as n —> oo). 

For this case study an optimally-tuned WBIC method has lower theoretical mean-squared 
error than an optimally-tuned PP method (again, ignoring practical details for the moment). 
The normal means model, whilst only one model, can be considered as a caricature of 
“regular” inference problems. This example thus suggests that in scenarios with either 
informative data or very vague priors, for regular models, PP can severely under-perform an 
optimally configured WBIC estimator. 
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However, there are several factors that are relevant in practice that are not captured by 
the idealised formulation above: 


1. The optimal temperature distribution p*(t) for thermodynamic integration is generally 
unavailable in closed-form. 


2. It is common to use quadrature to integrate out t in the thermodynamic integral, which 
can be more efficient than joint Monte Carlo sampling of ( 9 , t) but induces a bias into 
the estimate. 


3. The t* in WBIC is unknown and a guess of t w = l/log(n) is used in practice which 
may induce a significant error. 

4. Several extensions of the power posterior approach have been developed which have 
improved the statistical efficiency of the estimator, (Friel et al 2014) and (Oates et al 


2016). Similar development and extension may also be possible for WBIC. 


The above considerations motivate an empirical investigation of the relative merits of the 
two approaches which we now explore. 


5 Empirical examples 

We consider four examples where in all cases the motivation is to assess the performance 
of WBIC as an evidence approximation. The first model is one for which it is possible to 
calculate both log p(y) and WBIC analytically. The second model allows exact calculation 
of log p(y) only. The third model, logistic regression, is one where neither the log evidence 
nor WBIC can be evaluated exactly. The final model is a finite Gaussian mixture model, 
a singular model that WBIC was designed to handle and where neither the evidence nor 
WBIC can be evaluated exactly. In all four models, the approximation t w = l/log(n) is 
used. 


5.1 A tractable normal model 


Example 2. Consider again the tractable normal model from Section [^} Here 100 datasets 
were simulated for each of the following values of n = 50,100,1000,10000. Within each 
dataset y,; ~ 1V(0,1) and a priori, 6 ~ 1V(0,10). For each value of n, the optimal temperature 
t* was found by solving the identity in Lemma 3.1, In addition, the temperature t w = 
corresponding to WBIC was also recorded. 


The results are displayed in Figure [2|a). Clearly, as n increases, as expected, the optimal 
temperature becomes closer to the temperature corresponding to WBIC. However, for rela¬ 
tively small values of n, there is typically a large discrepancy between t w and t*. Through 
closer inspection of the curve of expected log deviances and the temperature, see Figure [5] 
for example, it is clear that overestimates of the optimal temperature will not necessarily 
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Figure 2: Tractable normal model: (a) Box-plots of the sample distribution of true tempera¬ 
ture for 100 datasets for various sizes n. Also displayed on each box-plot are the temperature 
corresponding to the WBIC estimate of the log evidence, (b) Box-plots of the difference 
between WBIC and the log evidence for 100 independent datasets with n — 50 observations 
and each with different prior variances. It is clear that the difference grows larger as the 
variance grows and that the WBIC typically overestimates the log evidence 
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translate to large overestimates of the log evidence as the curve is reasonably flat as the 
temperature approaches one. In Figure [2|b) for each value of n, the prior variance v is now 
set as 10,100,1000 with the data simulated as y ~ IV(0,1) as in (a) and it can be seen that, 
as expected, larger prior variances result in poorer WBIC estimates of the log evidence. 
Interestingly, WBIC tends to overestimate the log evidence. 

Consider now the case of unit information priors (Kass and Wasserman 1995) and con¬ 
sidered subsequently in the context of the Bayesian information criterion (BIC or Schwarz 
criterion) by Raftery (1999) and Volinsky and Raftery (2000) in sociology and survival mod¬ 
els, respectively. A unit information prior represents the amount of information contained 
in one observation of the data, such priors can be quite informative and are used here to 
illustrate the applicability of WBIC to this model. 

In the present model, correct specification of the mean of a unit information prior, here 
9 ~ N(m, 1), was vital to the performance of WBIC. Figure [3] illustrates WBIC plotted 
against log p(y) for data simulated from iV(0,1) of size n = 10000 and a prior mean of 
m — 0 or m — 1 with unit variance in either case. The WBIC approximation to log p(y) is 
particularly bad for the case where m — 1 (and this effect increases with sample size). 

Therefore the question arises as to the appropriate prior mean for a unit information prior 
in this circumstance. The data informed prior 9 ~ N(y, 1), or equivalently define y = y — y 
and the prior 9 ~ IV(0,1), is one obvious candidate for this. Though by this correction 
information about the mean is now wholly dependent on the data. 

An interesting observation can be made for fixed n, mean corrected data and the unit 
information prior 9 ~ N(0, 1), the difference between t w and t* is constant for every simulated 
dataset. That is, the simulation produces a deterministic result. Similarly, the difference 
between WBIC and log p(y) is also deterministic for every simulated dataset. 

In Figure [ij the optimal temperature t* satisfying equation ([8]) is plotted against the 
temperature, 1/ log(n), for datasets of size n = 3,4, 5,..., 50, 60, 70,..., 100000; a WBIC 
estimate with t w = 1/ log(n) is of course not suitable for n — 1,2 and t e [0,1]. The biggest 
differences occur for small n. 

It is reassuring that the method performs admirably for the case of mean corrected data 
and a unit information prior. Though again, WBIC tends to slightly overestimate the log 
evidence in this case. 


5.2 Non-nested linear regression 

Here we consider using WBIC to compute a Bayes factor and compare the results to existing 
methods to estimate the marginal likelihoods. 


Example 3. The data considered in this section describe the maximum compression strength 
parallel to the grain y t , density Xi and density adjusted for resin content Zi for n = 42 
specimens of radiata pine. Given the investigation of the tractable normal model, Sect. 5.1 


WBIC is not expected to perform particularly well with such a small sample size though. 
These data originate from (Williams 1959). It is wished to determine whether the density or 
the resin-adjusted density is a better predictor of compression strength parallel to the grain. 
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Figure 3: Tractable normal model: WBIC against log p(y) for a mis-specified prior mean. 
One hundred datasets of size n = 1000 are simulated from y ~ iV(0,1) and WBIC is evaluated 
at t — using the prior 6 ~ N(m, 1) where m is 0 or 1. Note the stark difference in the 
log evidence and WBIC for the mis-specified model. The line where WBIC = log p(y) is 
marked on the figure 
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WBIC at 1 /log(n) 
(a) 


Sample size n = 3,4,..,50,60,..,100000 

(b) 


Figure 4: Tractable normal model: In both plots the data are mean corrected and a unit 
information prior is used, that is 9 ~ 1V(0,1), and the sample size varies from n — 3 to n — 
100000. (a) For each n, the temperatures are compared between the optimal temperature 


t* , (8), plotted against t = 


l°g(n)' 


Smaller values of n incur a bigger difference in the two 


temperatures and the relationship is surprisingly regular as t becomes large and the two 
temperatures become equal. The line t = t* is included in the figure, (b) The difference 
between WBIC and log p(y) is plotted against the sample size (up to n = 3000). Even for 
relatively small n, WBIC is accurate. 
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With this in mind, two Gaussian linear regression models are considered; 


Model 1: i/i — a + (3(xi — x) + e i: e; ~ N( 0, r 1 ), 
Model 2: y, = 7 + 8{Zi - z) + y,, ty ~ N(0, A -1 ), 


( 13 ) 


for i = 1,... ,n. Under an informative set-up, the priors assumed for the line parameters 
(a,/3)' and ( 7 , 5)' had mean (3000,185)' with precision (inverse variance-covariance) tQ 0 
and XQ 0 respectively where Q 0 = diag(r 0 , s 0 ). The values of r 0 and s 0 were taken to be 0.06 
and 6, respectively. A gamma prior with shape a 0 = 6 and rate b 0 = 4 x 300 2 was taken for 
both t and A. These prior assumptions are broadly similar to the priors assumed for this 
data in other analyses. See for example (Friel and Pettitt 2008). 


It is possible to compute the exact marginal likelihood for both of these models due to 
the prior assumption that the precision on the mean of the regression line parameters is 
proportional to the error precision. For example, the marginal likelihood of Model 1 is given 
by 


p(y) = p 


= n -n/ 2 .a 0 /2 F {('U + Qp)/2} 

° r{a 0 / 2 } 

IQol 1/2 , /R \ —(n+ao)/2 

]m\u^ ( y Ry + b °> 


(14) 


where y = ( 7 / 1 ,..., y n )', M = X'X + Q 0 and R = I — XM^X' with the ith row of A" equal 
to (1 xf) and / is the 2x2 identity matrix. 

The exact value of the Bayes factor of Model 2 over Model 1 is given in Table [T| to 
show a comparison with other approaches to estimating the evidence and Bayes factor. This 


example was examined in detail in (Friel and Wyse 2012) and we refer the reader to this 


paper for precise details of how these methods were implemented. The key point to take from 
this is that WBIC is reasonably competitive with the other methods, but at a significantly 
reduced computational overhead cost. 

Figure |5] plots the expected log deviance with respect to p{0\y,t) versus the temperature 
t. A fine grid of discrete temperatures in the range [0,1] is employed and Eig\ yjt *[\og f(y\0)] 
is estimated for each t* £ [0,1] using a long MCMC run targeting the power posterior 
p(0\y,ti) The vertical line on the left hand side corresponds to the WBIC temperature 
t w = lo „.| 10) = 0.267. The vertical line on the left hand side plots the temperature (t* ~ 0.19) 
corresponding the true value of the log evidence. 

Table [2] shows the systematic bias in the estimates of the log evidence using WBIC, based 
on 20 independent MCMC runs. 

We now consider the same evidence comparisons as those made in Table [2] but under a 
unit information prior formulation. I 11 this regression model we re-parameterise the hyper¬ 
parameters as 

(a, 13)' = ( 7 , 5)' = + / 2 ' 
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Method mean(SF 2 i) (S.E.(BF 2 i)) 

Exact 


meanfi^i^i) 


Laplace approximation 
Laplace approximation MAP 
Harmonic mean estimator 
Chib & Jeliazkov’s method 
Annealed importance sampling 
Power posteriors 
Nested sampling 
WBIC 


4553.65 

— 

4553.63 

— 

4553.74 

(1.05) 

3718.57 

(909.17) 

4553.72 

( 0 . 66 ) 

4542.43 

(140.27) 

4535.11 

(74.75) 

6817.52 

(6980.82) 

4469.11 

(372.15) 


Table 1: Radiata Pine: Comparison of different approaches to estimating the Bayes factor 
of Model 2 over Model 1 based on 20 runs of each algorithm for the Radiata Pine data. 



Figure 5: Pine data: Expected log deviance vs temperature. The vertical line on the left 
shows the temperature corresponding to the true evidence. The vertical line on the right 
corresponds to the temperature at the WBIC estimate of the evidence. 


Model 1 Model 2 
True logp(y[m) —310.1283 —301.7046 

mean WBIC -308.3390 -299.8437 
s.e WBIC 0.01 0.01 


Table 2: Radiata pine data: WBIC estimate of the marginal likelihood for Model 2 and 
Model 1 compared to the true value of the log marginal likelihood for each method, based 
on 20 independent runs. 
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Method 

mean(logp(y|M 1 )) 

(S.E) 

mean(logp(y|M 2 )) 

(S.E) 

mean(5E 2 i) 

(S.E.) 

Exact 

-327.23 

B 

-324.93 

B 

9.97 

(-) 

Laplace approximation 

-327.23 

B 

-324.93 

(-) 

9.98 

(-) 

Laplace approximation MAP 

-327.24 

(0.00012) 

-324.93 

(0.000079) 

9.99 

(0.0015) 

Harmonic mean estimator 

-331.09 

(0.92) 

-328.96 

(0.98) 

21.24 

(43.67) 

Chib &; Jeliazkov’s method 

-327.23 

(0.00014) 

-324.93 

(0.000087) 

9.99 

(0.0018) 

Annealed importance sampling 

-327.23 

(0.038) 

-324.95 

(0.034) 

9.77 

(0.40) 

Power posteriors 

-327.23 

(0.017) 

-324.93 

(0.019) 

9.99 

(0.26) 

Nested sampling 

-328.26 

(1.49) 

-326.29 

(1.48) 

31.55 

(62.25) 

WBIC 

-331.04 

(0.12) 

-329.54 

(0.15) 

4.54 

(0.82) 


Tabic 3: Radiata Pine: Estimated log marginal likelihoods for each model and corresponding 
Bayes factors for each method under a unit information prior. These estimates are based on 
20 runs of the algorithm 


which is very similar to the previous values of (3000,185)’ considered above. The precision 
matrix Q 0 is now defined as 

Qo = n [(A';*,)" 1 + (A'^A'j)- 1 ] /2 « ( 35 0 28 ^ ^ 

where p\ = p 2 = 2 is the number of covariates in each model’s design matrix. The variance 
parameters A and r share hyper-parameters 

a 0 = 1, b 0 = —y'Ry = 9701264, (15) 

n 

where R = (Ri + R 2 )/ 2, R. t = I n - X % Mf 1 X- and M t = X-X t + Q 0 . 

The evidence estimates under the unit information prior parameters are presented in 
Tabic [3l The values in the table are found from 20 runs under each method. The harmonic 
mean and nested sampling estimates do not perform as poorly as in the previous set-up 
from Table [T| and the WBIC seems to be comparable with these methods, which are not 
significantly different from the true evidence values for both models. The standard error 
estimates for all models is markedly reduced from those found for the vague prior BF 2 i, 
though still unreliably high for the harmonic mean and nested sampling estimates. The 
BF 2 1 for WBIC is unsatisfactory however given the standard error estimate. Recall there 
are n — 42 observations for these data and as seen with the tractable normal model, this 
seems too small for the WBIC estimate to perform well. 

5.3 Logistic regression models 

Example 4. Here we consider the Pima Indians dataset. These data contain instances of 
diabetes and a range of possible diabetes indicators for n = 532 Pima Indian women aged 
21 years or over. There are seven potential predictors of diabetes recorded for this group; 
number of pregnancies (NP); plasma glucose concentration (PGC); diastolic blood pressure 
(BP); triceps skin fold thickness (TST); body mass index (BMI); diabetes pedigree function 
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(DP) and age (AGE). This gives 129 potential models (including a model with only a constant 
term). Diabetes incidence (y) is modelled by the likelihood 

n 

!(v\6) = X\vfa-Vi) l -' H ( 16 ) 

2=1 

where the probability of incidence for person i, p i; is related to the covariates (including 
constant term) x; = (1, Xu ,..., Xid)' and the parameters 9 = (9q, Q \,..., Qd)' by 


log 


Pi 


Pi 


= 9'xi 


(17) 


where d is the number of explanatory variables. An independent multivariate Gaussian prior 
is assumed for the elements of 9, so that 


\ rf / 2 




The covariates were standardized before analysis. 


(18) 


A long reversible jump run (Green 1995) revealed that the two models with the highest 
posterior probability were 


Model 1: logit (p) = 1 + NP + PGC + BMI + DP 

Model 2: logit (p) = 1 + NP + PGC + BMI + DP + AGE. (19) 


This reversible jump algorithm assumed a non-informative value of r = 0.01 for the prior on 
the regression parameters. For this value of r we carried out a reduced reversible jump run 
restricting to jumps only between these two models. The prior probabilities of the models 
were adjusted to allow for very frequent jumps (about 29%). This gave a Bayes factor BF 12 
of 13.96 which will be used as a benchmark to compare the other methods to. 

Table [4] displays results of estimates of the evidence for both models which were also 
implemented for this example in (Friel and Wyse 2012). Here the WBIC estimate is not as 


competitive with the more computationally demanding methods. 

Figure [ 6 ] displays a ’close-up’ plot of temperature versus expected log deviance for a 
small range of temperatures. MCMC was used to estimate the expected log-deviance at 
each powered posterior. Again, as for the previous example, the temperature t* such that 
equation Q is satisfied is smaller than 1 /logn. 

As before, we now consider the performance of the evidence estimation techniques under 
a unit information prior formulation. As the models have different numbers of parameters 
the hyper-parameters are slightly different for the two models under comparison, which, of 
course, was also necessary in the estimates presented in Table [4} The prior mean for each 
model is defined as 


0i ~ MVN ( MLE(0<), 

V n 


,-1 


i = 1 or 2 . 
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Method 

mean(logp(i/jM 1 )) 

(S.E) 

mean(Iog p(y\Mi )) 

(S.E) 

mean(BFi2) 

(S.E.) 

Relative speed 

Laplace approximation 

-257.26 

B 

-259.89 

(-) 

13.94 

B 

i 

Laplace approximation MAP 

-257.26 

(0.015) 

-259.93 

(0.026) 

14.44 

(0.50) 

108 

Harmonic mean estimator 

-279.50 

(0.65) 

-285.37 

(0.58) 

487.66 

(567.39) 

108 

Chib & Jeliazkov’s method 

-257.23 

(0.02) 

-259.86 

(0.02) 

13.89 

(0.46) 

44 

Annealed importance sampling 

-257.29 

(0.54) 

-260.38 

(0.68) 

34.04 

(35.34) 

194 

Power posteriors 

-260.91 

(0.14) 

-264.06 

(0.12) 

23.82 

(4.66) 

184 

Nested sampling 

-256.64 

(1.36) 

-260.96 

(2.39) 

75.19 

(206.80) 

808 

WBIC 

-251.49 

(0.63) 

-253.49 

(0.45) 

9.30 

(6.46) 

17 


Table 4: Pima dataset: Estimated log marginal likelihoods for each model and corresponding 
Bayes factors for each method along with relative run times with r = 0.01. The standard 
error estimates are based on 20 runs of the algorithm. 



Figure 6: Pima dataset: Expected log deviance vs temperature. The green vertical line on 
the left shows the temperature t* corresponding to the true evidence. The red vertical line 
on the right corresponds to the temperature t w at the WBIC estimate of the evidence. 
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Method 

mean(log p(y\ M 1 )) 

(S.E) 

mean (log p (j/1M 2 )) 

(S.E) 

mean(BF 12 ) 

(S.E.) 

Laplace approximation 

-244.34 

(-) 

-242.96 

(~) 

0.25 

(A 

Laplace approximation MAP 

-244.41 

(0.025) 

-243.05 

(0.036) 

0.26 

(0.013) 

Harmonic mean estimator 

-214.30 

(2.89) 

-203.66 

(2.29) 

0.0021 

(0.0082) 

Chib & Jeliazkov’s method 

-244.39 

(0.21) 

-243.23 

(0.25) 

0.33 

(0.11) 

Annealed importance sampling 

-244.34 

(0.0012) 

-242.96 

(0.0015) 

0.25 

(0.00048) 

Power posteriors 

-244.34 

(0.0041) 

-242.96 

(0.0025) 

0.25 

(0.0011) 

Nested sampling 

-244.34 

(0.0016) 

-242.96 

(0.0018) 

0.25 

(0.00055) 

WBIC 

-244.34 

(0.0011) 

-242.96 

(0.0010) 

0.25 

(0.00048) 


Table 5: Pima dataset: Log marginal likelihoods for each method and the Bayes factor 
estimates under a unit information prior. The mean and standard error values in the Table 
are based on 20 runs of the algorithm 


The evidence estimates under the unit information prior are given in Table [5] The values 
in the table are based on 20 runs of the algorithm. Ignoring the harmonic mean estimate, the 
results are quite striking. All the log evidence estimates are very similar and the standard 
errors are extremely small. One might expect that WBIC would perform well here given the 
sample size of n = 532. ffowever, Table [4] shows that WBIC is not as competitive as the 
other competing methods. However, when unit information priors are used, the results in 
Table [5] show that WBIC performs as well as all of the other methods under consideration. 

5.4 Finite Gaussian mixture model 

Watanabe introduced WBIC with the goal of approximating the evidence for singular sta¬ 
tistical models. Here we present an analysis of WBIC for one such singular model, namely 
a finite mixture model. 

Example 5. Consider now a finite mixture of K components where for i — 1,..., n and with 
nk the number of observations in the kth component n k — n ) there exist observations 

y = (■ yi ,... ,y n ). Conditioned on a set of labels z = (z ^,..., z n ) satisfying p(zi = k) = w k 
with J2k =i w k — 1 the likelihood is given by 

n R 1 / 1 \ 

p( y|hW 2 ,z) = n Yl T Ci=k} —2 exp - Ah ) 2 • ( 20 ) 

i =1 k =1 V 271,7 k V Z<J k / 

Where /j = {pi ,..., Pk), & 2 = Wf, ■ ■ ■, crfi-} and I{ Zi =k} denotes the indicator function taking 
the value 1 when Zi = k and zero, otherwise. Prior distributions are assigned to all model 
parameters as follows: 

PkWl ~ N(po, (Tq), 

a 2 k ~ Ga(a 0 ,Po), k = l,...,K 
w ~ Dir(a ,..., a) 

Zi ~ Multinomial(w), i — 1,..., n. 
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With observations in the kth component given by C k , the full-conditional distributions for 
parameters fik, , Zi are given by 


Z% k\ |/j, /ifc, &ki W k OC W k 


1 




exp 


2 *1 


(y> - 


w|y, z ~ Dir(a + n ±,..., a + rik) 
li k \y,al,z ~ N (■ m k ,s 2 k ) 


Ofc|y, /ifc, z ~ Inverse-Gamma J a 0 + rifc/2, /3 0 + E (2/» - hfc) 2 / 2 


( 21 ) 


with m k = 




2/i 


and si 




-l 


Finally, hyper-parameters are specified as, po = 0; erg = 100; «o = /^o = 1/2; a = 4. 

Here 50 datasets are simulated from a Gaussian mixture with three components such 
that /i = (—5, 0, 5) and cr 2 = c(l, 1,1). The WBIC and the power posterior approximations 
of the log-evidence are compared here; each power posterior estimate has tj = (j/(N)) 5 for 
j = 1, 2,..., N = 40, as suggested by (Friel et al 2014]). Figure [?] presents the WBIC against 
the power posterior estimates of the evidence. Again there exists a tendency for WBIC to 
overestimate the evidence, relative to power posteriors, as has been exemplified for all four 
models under consideration. 

As before the analysis was repeated under a unit information prior formulation. Results, 
not presented here, were very similar to those for the tractable normal model. 


6 Discussion 

Estimating the model evidence is well understood to be a very challenging task. Watanabe’s 
WBIC is an interesting contribution to this literature. Although motivated from statistical 
learning theory, in principle it can be applied to both regular and singular models. From an 
implcmentational point of view, it offers to provide a computationally cheap approximation 
of the evidence and this is an overwhelming advantage in favour of its use. Our theoretical 
case-study has suggested that an optimally-tuned WBIC estimator (where one has access to 
the optimal temperature t*) is likely to perform better than the power posterior approach. 
However, the empirical experiments in this paper suggest that it can provide a poor estimate 
of the evidence in practice, when t w is substituted for t*, particularly in cases where one uses 
weakly informative priors. Of course, it has been argued in the literature that specification 
of priors for statistical model selection requires careful choice. In particular, unit information 
priors are often advocated for this purpose. Our study suggests that WBIC could provide a 
useful and cost-effective approach in this case. 

In terms of future directions, an interesting question to investigate would be whether one 


may provide a useful starting point and we are currently working in this direction. 


could improve upon the default temperature, t w = l/log(n). Insights such as Lemma 3.1 
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Figure 7: Finite Gaussian mixture model: WBIC against the power posterior estimate of 
the evidence, (a) Sample of size n = 50. (b) Sample of size n = 1000. The approximation 
is performing in a materially similar matter to the tractable normal model. 
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